This is a course that should help me to better work with data analysis in the future. I already love markdown :). I have R knowledge, but I never completed a course in the language, so sometimes I find I am doing very simple things the hardest way imaginable. I hope this course will close these gaps in my knowledge.
The crash course is nice, as I’m not that used to the tidyverse syntax, and was taught to use base R (subsets, slices and such). I use git for my work, so it’s not very convenient for me to start using PATs. But i checked, and it indeed works. I just prefer to do it through gpg.
Here’s the link to the repository
link to the github
repository
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 5 04:01:28 2023"
The text continues here.
Describe the work you have done this week and summarize your learning.
After understanding the metadata file and skipping the Finnish language instructions that were unnecessary it was easy to prepare a data file this data file. The script can be found here.I used plyr and tiidyverse to do the conversions of the raw data into a workable file.
The dataset is the result of a survey on teaching and learning that was conducted as a research project in 2014. The dataset was loaded in the next chunk and the dimensions and structure of the document are output.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
learning2014_csv <- read_csv("/home/alex/ods_course/IODS-project_R/data/learning2014_AA.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014_csv)
## [1] 166 7
str(learning2014_csv)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The data.frame contains 166 lines and 7 columns. The first column encodes gender, and has the character type. The rest are all numeric, and contain the data which will be used for the regression analysis. The “attitude”, “deep”, “stra”, and “surf” columns contain the combinations of results from the original dataframe. The full metadata explanation can be found here. The explanation for the study can be found here.
Briefly:
summaries <- learning2014_csv %>%
summarise(across(where(is.numeric), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.), median = ~median(.))))
print(t(summaries))
## [,1]
## age_mean 25.5120482
## age_sd 7.7660785
## age_min 17.0000000
## age_max 55.0000000
## age_median 22.0000000
## attitude_mean 3.1427711
## attitude_sd 0.7299079
## attitude_min 1.4000000
## attitude_max 5.0000000
## attitude_median 3.2000000
## deep_mean 3.6797189
## deep_sd 0.5541369
## deep_min 1.5833333
## deep_max 4.9166667
## deep_median 3.6666667
## stra_mean 3.1212349
## stra_sd 0.7718318
## stra_min 1.2500000
## stra_max 5.0000000
## stra_median 3.1875000
## surf_mean 2.7871486
## surf_sd 0.5288405
## surf_min 1.5833333
## surf_max 4.3333333
## surf_median 2.8333333
## points_mean 22.7168675
## points_sd 5.8948836
## points_min 7.0000000
## points_max 33.0000000
## points_median 23.0000000
ggpairs(learning2014_csv, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
Description
These results make sense at the first glance.
I have chosen the three variables to check: attitude, age and surf
linear_modelling_for_learning2014_csv <- lm(points ~ surf + attitude + age, data = learning2014_csv)
summary(linear_modelling_for_learning2014_csv)
##
## Call:
## lm(formula = points ~ surf + attitude + age, data = learning2014_csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.973 -3.430 0.167 3.997 10.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.85628 3.53786 4.765 4.18e-06 ***
## surf -0.96049 0.79943 -1.201 0.231
## attitude 3.42388 0.57353 5.970 1.45e-08 ***
## age -0.08713 0.05361 -1.625 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared: 0.2082, Adjusted R-squared: 0.1935
## F-statistic: 14.2 on 3 and 162 DF, p-value: 2.921e-08
Only one explanatory variable showed statistically significant connection to the points outcome, thus only attitude variable will be kept for the next modelling step.
linear_modelling_for_learning2014_csv_pruned <-lm(points ~ attitude, data = learning2014_csv)
summary(linear_modelling_for_learning2014_csv_pruned)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014_csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The second model is simpler, but have not lost too much of the explanatory power: - The p-value is comparably low, much lower than 0.001, meaning that there is strong support to reject the hypothesis that the 2 variables have no connection. - The estimate shows how the two variables are connected, for one unit change in points the attitude variable changes by 3.5255 The adjusted R-value 0.1856 is relatively low, indicating that the model may not be very effective in predicting or explaining the dependent variable. This might be due to various reasons like missing important predictors, non-linear relationships, or high levels of noise in the data.
plot(linear_modelling_for_learning2014_csv_pruned, which=c(1,2,5))
Using these plots we can investigate the assumptions of the model.
if (!require(tidyverse) == T) {
install.packages("tidyverse")
}
library(tidyverse)
if (!require(gt) == T) {
install.packages("gt")
}
## Loading required package: gt
library(gt)
The dataframes were merged a modified according to the instructions. The resulting dataframe contains 370 observations with 35 columns. The r-script used is here.
aa_alc <- read_csv("/home/alex/ods_course/IODS-project_R/data/aa_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(aa_alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime : num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities : chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ average_alc: num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. average_alc = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(aa_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "average_alc" "high_use"
The data show student achievement in secondary education of two Portuguese schools, the data contains 370 observations of 35 variables . The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The different features can be used to analyse the dataset and make predictions. A more detailed information can be found here
“school”: The name or type of school the student attends.
“sex”: The student’s gender.
“age”: The student’s age.
“address”: The student’s home address or type of locality (urban/rural).
“famsize”: The size of the student’s family.
“Pstatus”: The marital status of the student’s parents.
“Medu”: The mother’s education level.
“Fedu”: The father’s education level.
“Mjob”: The mother’s job.
“Fjob”: The father’s job.
“reason”: The reason for choosing this school.
“guardian”: The student’s primary guardian.
“traveltime”: Time taken to travel to school.
“studytime”: Time spent on studying outside of school.
“schoolsup”: Whether the student receives school support.
“famsup”: Whether the student receives family support.
“activities”: Participation in extracurricular activities.
“nursery”: Whether the student attended nursery school.
“higher”: Aspirations for higher education.
“internet”: Access to the internet at home.
“romantic”: Involvement in a romantic relationship.
“famrel”: Quality of family relationships.
“freetime”: Amount of free time after school.
“goout”: Frequency of going out with friends.
“Dalc”: Daily alcohol consumption.
“Walc”: Weekly alcohol consumption.
“health”: General health status.
“failures”: Number of past class failures.
“paid”: Whether the student is enrolled in extra paid classes.
“absences”: Number of school absences.
“G1”: Grade in the first period.
“G2”: Grade in the second period.
“G3”: Final grade.
“average_alc”: Average alcohol consumption (perhaps a computed variable from Dalc and Walc).
“high_use”: Indicator of high alcohol use (likely a derived variable).
I decided to look at “failures”, “absences”, “freetime” and “famrel”. I would expect that:
summaries <- aa_alc %>% group_by(high_use) %>% select(.,c("failures", "absences", "freetime", "famrel","high_use")) %>%
summarise(across(where(is.numeric), list(Average = ~mean(.), sd = ~sd(.))))
gt(summaries) %>% cols_align("left") %>% opt_stylize(color="gray", style=3) %>% tab_header("Table for chosen variables")
| Table for chosen variables | ||||||||
| high_use | failures_Average | failures_sd | absences_Average | absences_sd | freetime_Average | freetime_sd | famrel_Average | famrel_sd |
|---|---|---|---|---|---|---|---|---|
| FALSE | 0.1196911 | 0.4370868 | 3.710425 | 4.464017 | 3.119691 | 0.9869096 | 4.007722 | 0.8935266 |
| TRUE | 0.3513514 | 0.7464906 | 6.378378 | 7.060845 | 3.468468 | 0.9421428 | 3.765766 | 0.9337603 |
These are the averages and standard deviation values for all the variables, to better see which variables are interesting to look at closer.
First looking at overall structure, whehther there is something to connect the chosen variables.
ggpairs(select(aa_alc,c("failures", "absences", "freetime", "famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
As it turns out the free time is strongly correlated with family relations, which I did not expect.
ggpairs(select(aa_alc,c("failures","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The number of failures is slightly higher in the high consumption group, but not bu much. The overwhelming majority of participants in each case have 0 falures. This was expected..
ggpairs(select(aa_alc,c("absences","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The number of absences in the high use groups is higher, as expected. The standard deviation is also much higher in this group, suggesting more variability in the absences.
ggpairs(select(aa_alc,c("freetime","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The freetime variable has different mean and very similar sd, the high consumption group has a slightly higher average. I was interested in the relation, and the results are interesting.
ggpairs(select(aa_alc,c("famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The family relation variable seems to be lower in the high consumption group, as expected.
first_log_regression <- glm(high_use ~ failures + absences + freetime + famrel, data = aa_alc, family = "binomial")
# create model summary and extract the coeffs
summary(first_log_regression)
##
## Call:
## glm(formula = high_use ~ failures + absences + freetime + famrel,
## family = "binomial", data = aa_alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55560 0.63703 -2.442 0.014608 *
## failures 0.58297 0.20259 2.878 0.004007 **
## absences 0.08375 0.02230 3.755 0.000173 ***
## freetime 0.43091 0.12804 3.365 0.000764 ***
## famrel -0.31981 0.13098 -2.442 0.014622 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 408.17 on 365 degrees of freedom
## AIC: 418.17
##
## Number of Fisher Scoring iterations: 4
From the model we can see the following:
coeffs <- summary(first_log_regression) %>% coefficients()
coeffs
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55560199 0.63703172 -2.441954 0.0146080178
## failures 0.58296988 0.20258843 2.877607 0.0040070412
## absences 0.08375499 0.02230317 3.755295 0.0001731376
## freetime 0.43090906 0.12804024 3.365419 0.0007642749
## famrel -0.31981102 0.13098388 -2.441606 0.0146221015
OddsRatio <- exp(coeffs)
confidence <- confint(first_log_regression)
## Waiting for profiling to be done...
result <- cbind(OddsRatio, confidence)
result
## Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## (Intercept) 0.2110623 1.890860 0.08699073 1.014715 -2.8317664 -0.32620631
## failures 1.7913506 1.224568 17.77169287 1.004015 0.1917587 0.99258448
## absences 1.0873624 1.022554 42.74681512 1.000173 0.0420769 0.13014453
## freetime 1.5386556 1.136599 28.94562433 1.000765 0.1840339 0.68715409
## famrel 0.7262863 1.139949 0.08702100 1.014730 -0.5789180 -0.06341917
Here we can see the same, the first three predictors have positive association with consumption, the family relation variable has a negative association. We can see that the strongest connection is with failures, then freetime and famrel, while the link to absences is low.
It is important to keep in mind that this is logistical regression, not the previosuly investigated linear regression. Tha means that the estimate in this table represents the “odds ratos” and be thought of in terms of likelihood. It means, that for examples in our case the increase in failures by one unit increases the likelihood of high alcohol consumption by 1.8.
The results are in agreement with the hypotheses I started with. Interestingly there seems to be a connection between the free time and the consumption levels, which were not obvious when just looking at the data.
We can use test the predictive of the model that we created earlier to see if it can be used to actually predict the alcohol consumption based on the the four chosen variables. We can predict for each student the probability of increased consumption.
aa_alc$predicted_probabilities <- predict(first_log_regression, type = "response")
aa_alc <- mutate(aa_alc,prediction = predicted_probabilities > 0.5)
table(Actual = aa_alc$high_use, Predicted = aa_alc$prediction)
## Predicted
## Actual FALSE TRUE
## FALSE 236 23
## TRUE 85 26
These are the 2x2 cross tabulation count results.
((table(Actual_percentage = aa_alc$high_use, Predicted_percentage = aa_alc$prediction))/length(aa_alc$prediction)) * 100
## Predicted_percentage
## Actual_percentage FALSE TRUE
## FALSE 63.783784 6.216216
## TRUE 22.972973 7.027027
The False outcome was correctly predicted for 236 participants (63%), True was predicted for 26 (7%).
library(ggplot2)
ggplot(aa_alc, aes(x=high_use, y=prediction)) +
geom_jitter(color="blue") +
theme_minimal() +
labs(title="Actual vs Predicted Alcohol Consumption")
confusion_matrix <- table(Predicted = aa_alc$prediction,Actual = aa_alc$high_use)
training_error <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
training_error
## [1] 0.2918919
The total training error was 29%
We can compare our model to a simple guessing strategy - always predicting the most common class
First we determine which is the bigger group:
sum(aa_alc$high_use)/length(aa_alc$high_use)
## [1] 0.3
There are 30% of higher consuming participants, so the more prevalent group is low consuming.
simple_guess_accuracy <- mean(aa_alc$high_use == F)
model_accuracy <- mean(aa_alc$high_use == aa_alc$prediction)
simple_guess_accuracy
## [1] 0.7
model_accuracy
## [1] 0.7081081
The accuracy of the model is marginally better than guessing.
We can preform a cross-validation of the model, meaning that we will
test the model performance on subset of the same data fo determine how
accurately the chosen model can work in practice. For that we will be
using the cv.glm function from the boot library. The idea
is to test how well the model predicts the status using the ten
different subset subsets fo the data in sequence.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
library(boot)
cv <- cv.glm(data = aa_alc, cost = loss_func, glmfit = first_log_regression, K = 10)
cv$delta[1]
## [1] 0.2945946
The prediction of my model is 0.289 compared to the 0.26, which is worse. I used 4 predictors, and i did not include the sex predictor into my model, which might be important. Additionally two of my chosen factors were correlated quite strongly, so one might not add a lot to the model.
We can try to use all the predictors to see how the number of predictors influences the model parameters. For this we exclude all the predictors that were used to create the outcome variable
First we load additional lobraries
if (!require(caret) == T) {
install.packages("caret")
}
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(caret)
if (!require(glmnet) == T) {
install.packages("glmnet")
}
## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(glmnet)
predictors <- setdiff(colnames(aa_alc), c("Dalc","Walc","average_alc","high_use","predicted_probabilities","prediction" ))
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()
aa_alc$high_use <- as.factor(aa_alc$high_use)
# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
# Select a subset of predictors
current_predictors <- predictors[1:i]
formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
set.seed(123) # for reproducibility
fitControl <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
# Store training and testing errors
training_errors <- c(training_errors, mean(model$results$Accuracy))
testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
number_of_predictors <- c(number_of_predictors, i)
model_summaries[[i]] <- model$coefnames
}
results_df <- data.frame(
Number_of_Predictors = number_of_predictors,
Training_Error = training_errors,
Testing_Error = testing_errors
)
# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
geom_line(aes(y = Training_Error, colour = "Training Error")) +
geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
labs(title = "Training and Testing Errors by Number of Predictors",
x = "Number of Predictors",
y = "Error") +
theme_minimal()
This shows, that the number of predictors makes the model worse, however
there is an increase in model performance at 26th point in the plot. If
we look at the predicot first present in the 26th point:
setdiff(model_summaries[[26]],model_summaries[[25]])
## [1] "failures"
As we know, the failures was a great predictor. So
to check how the plot looks if we start with the better predictors I
reversed the predictors vector.
predictors <- rev(predictors)
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()
# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
# Select a subset of predictors
current_predictors <- predictors[1:i]
formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
set.seed(123) # for reproducibility
fitControl <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
# Store training and testing errors
training_errors <- c(training_errors, mean(model$results$Accuracy))
testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
number_of_predictors <- c(number_of_predictors, i)
model_summaries[[i]] <- model$coefnames
}
results_df <- data.frame(
Number_of_Predictors = number_of_predictors,
Training_Error = training_errors,
Testing_Error = testing_errors
)
# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
geom_line(aes(y = Training_Error, colour = "Training Error")) +
geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
labs(title = "Training and Testing Errors by Number of Predictors",
x = "Number of Predictors",
y = "Error") +
theme_minimal()
That makes the model better overall, with the minimum value being:
## [1] 0.2454599
With adding of the guardian variable.
predictors[20]
## [1] "guardian"
This shows that increasing the number of predictors negatively influences the model, but adding valueable predictors imrpoves it.
library(MASS) # For the Boston dataset
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2) # For graphical representations
library(caret) # For data splitting and preprocessing
library(corrplot)
## corrplot 0.92 loaded
# Step 1: Load and Explore the Boston Dataset
data("Boston")
# Exploring the structure and dimensions of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This dataset contains housing values in the suburbs of Boston, has 506 rows and 14 columns, all numerical, chas can be considered categorical as it can only be 0 and 1. Each row represents a different suburb. Columns are various features like crime rate, number of rooms, age of the housing, etc. More complete description can be found here
The variables have the foillowing decriptions:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All the statistics for each variable is not directly comparable, as is expected for real-world data. We can now the relations between the variables to look deeper into the data.
The distributions are skewed: nox and dis have skewed to low values, age to high values. The proportion of non-retail business acres per town, has a distribution with two peaks, as does the tax variable.
All the variables appear to be correlated. Only 8 pairs are not significantly correlated, all connected to the “categorical variable” Charles River. Most of the correlation make sense: e.g. nox concentrations and the distance to the city are negatively correlated, but nox and industrialisation are positively correlated.
Crime rate appears to have a statistically significant correlation all of vars, but chas. Seems to correlate negatively with 5 variables, and positively with 7 variables.
We can plot the correlation in a more visually pleasing way. All the relations between the variables can be seen clearly on this figure
cor_matrix <- cor(Boston)
corrplot(cor_matrix)
All the relations between the variables can be seen clearly on this
figure
As the data described very different phenomena, the values were not directly comparable. E.g. mean value for tax was 408, while nox was ~ 0.55. In order to eliminate that we can standardise the data.
scaled_Boston <- as.data.frame(scale(Boston))
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Now all the means are the same - 0. This standardization makes variables more comparable and often improves machine learning model performance.
Using quantiles as breakpoints, and removing the old crime rate variable.
quantiles <- quantile(scaled_Boston[, "crim"], probs = c(0, 0.25, 0.5, 0.75, 1))
scaled_Boston$categorical_crim <- cut(scaled_Boston[, "crim"], breaks = quantiles,
include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
scaled_Boston <- scaled_Boston[,-which(names(Boston) == "crim")]
Splitting the Dataset into Train and Test Sets
80% of the data is now in the training set, and the remaining 20% is in the test set.
set.seed(123) # For reproducibility
indexes <- createDataPartition(scaled_Boston$categorical_crim, p = 0.8, list = FALSE)
train_set <- scaled_Boston[indexes, ]
test_set <- scaled_Boston[-indexes, ]
# Loading necessary library
library(MASS) # For lda function
library(ggplot2) # For creating biplot
# Step 1: Fit the Linear Discriminant Analysis Model
# Fitting LDA with categorical crime rate as target variable
lda_model <- lda(categorical_crim ~ ., data = train_set)
# Step 2: Summary of the LDA Model
lda_model
## Call:
## lda(categorical_crim ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2512315 0.2487685 0.2487685 0.2512315
##
## Group means:
## zn indus chas nox rm age
## low 0.9279155 -0.9002730 -0.19513102 -0.8740789 0.4510628 -0.8867213
## med_low -0.0913696 -0.2911617 -0.03844192 -0.5807849 -0.1223160 -0.3793710
## med_high -0.3836558 0.1931104 0.15646403 0.3930228 0.1140281 0.4230461
## high -0.4872402 1.0149946 -0.07933396 1.0662955 -0.4222547 0.8161999
## dis rad tax ptratio black lstat
## low 0.8428080 -0.6789912 -0.7285408 -0.39200564 0.37567764 -0.77093834
## med_low 0.3597389 -0.5361295 -0.4615109 -0.03982856 0.31663833 -0.16048504
## med_high -0.3650188 -0.4542586 -0.3302709 -0.28998920 0.08434782 -0.06173942
## high -0.8559422 1.6596029 1.5294129 0.80577843 -0.76539336 0.90035435
## medv
## low 0.5324073
## med_low -0.0140094
## med_high 0.1977447
## high -0.7285395
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11143638 0.643005769 -0.9943789
## indus 0.08740548 -0.227273609 0.2772096
## chas 0.01766274 -0.058628846 0.1539449
## nox 0.18483474 -0.897262214 -1.3248989
## rm -0.01332317 -0.033607221 -0.1228014
## age 0.25569474 -0.405630630 -0.1938632
## dis -0.12730753 -0.346775059 0.0996160
## rad 4.07862012 0.908418329 -0.1089458
## tax 0.06001905 0.008407111 0.5321260
## ptratio 0.15504095 0.072786263 -0.3434405
## black -0.10715634 0.034155460 0.1240672
## lstat 0.14772261 -0.131485744 0.4522005
## medv 0.04936251 -0.364822504 -0.2088124
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9637 0.0274 0.0088
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# The biplot visualizes the linear discriminants (LD1 and LD2) and shows
# how the observations in the training set are separated based on the
# categorical crime rate.
classes <- as.numeric(train_set$categorical_crim)
plot(lda_model, dimen = 2, col = classes,pch = classes)
lda.arrows(lda_model, myscale = 2)
The LD1 is influences mostly by rad, the LD2 is influeneces by zn and nox in similar marnitude, but different direction.
# Step 6: Save Crime Categories and Update Test Set
# Saving the crime categories from the test set
test_crime_categories <- test_set$categorical_crim
# Removing the categorical crime variable from the test set
test_set <- test_set[,-which(names(test_set) == "categorical_crim")]
# Step 7: LDA Model Prediction
# Fit LDA model on the training set
library(MASS) # LDA is in the MASS package
fit_lda <- lda(categorical_crim ~ ., data = train_set)
# Predicting on the test set
predictions <- predict(fit_lda, newdata = test_set)
table, prop.table and
addmargins
table(Predicted = predictions$class, Actual = test_crime_categories) %>%
addmargins()
## Actual
## Predicted low med_low med_high high Sum
## low 17 3 1 0 21
## med_low 8 15 4 0 27
## med_high 0 7 17 1 25
## high 0 0 3 24 27
## Sum 25 25 25 25 100
The model is reasonably good, especially at the high and low values, and can be used to differentiate high and low groups. The middle groups become muddled.
Reloading the dataset, scaling and calculating the distances:
data(Boston)
Boston_scaled <- data.frame(scale(Boston))
dist_euc <- dist(Boston_scaled)
summary(dist_euc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Mean distance is 4.7. In order to determine the optimal number of clusters we will look at the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. We can perform clustering with different k numbers and then look at what happens to the WCSS when the k number is increased. If the value drops rapidly - the k number is good. But the larger the k the harder it may be to interpret the results. So we can start checking and plotting k from 2 to 20, jst to see what happens.
k_max <- 20
twcs <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})
elbow_data <- data.frame(
k = 1:k_max,
twcs = twcs
)
elbow_plot <- ggplot(elbow_data, aes(x = k, y = twcs)) +
geom_line() + # Connect points with lines
geom_point() + # Add points
scale_x_continuous(breaks = 1:k_max) + # Ensure k values are treated as continuous
labs(x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares",
title = "Determining Optimal k") +
theme_minimal()
elbow_plot
The plot changes the slope quite quickly at k = 2, so we can use this clustering for the further analysis.
Boston_scaled_km <- kmeans(Boston_scaled, 2)
pairs(Boston_scaled,col = Boston_scaled_km$cluster)
pairs(Boston_scaled[,c(1,10)],col = Boston_scaled_km$cluster)
There seems to be good cluster separation between crime and tax.
pairs(Boston_scaled[,c(3,5)],col = Boston_scaled_km$cluster)
An even better separation for indus and nox. High indus and high nox
cluster, and cluster of both low values.
pairs(Boston_scaled[,c(2,3)],col = Boston_scaled_km$cluster)
There seems also to be good separation for industry and zn variables,
two clusters: low indus and high zn, and vice versa. As we saw earlier
the chosen variables have high correlation, and logically should also be
correlated. So these results make sense.
The elbow plot shows that the last point with decreased WCSS is 6, so I decided to look what k=9 clustering looks like. And redoing the LDA using the cluster as the target classes.
set.seed(8) # For reproducibility
data("Boston")
Boston_scaled <- as.data.frame(scale(Boston))
Boston_scaled_km <- kmeans(Boston_scaled, centers = "6")
lda.fit <- lda(Boston_scaled_km$cluster ~ . , data = Boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(Boston_scaled_km$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
As we can see, the cluster 3 is saparate from all the othersm and the
Charles River variable is the main determinant for this cluster. Cluster
5 is separate from others, this separation is dependent on the radial
roads variable.
Installing/loading plotly
if (!require(plotly) == T) {
install.packages("plotly")
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(plotly)
lda.fit <- lda(categorical_crim ~ . , data = train_set)
model_predictors <- dplyr::select(train_set, -categorical_crim)
# check the dimensions
dim(model_predictors)
## [1] 406 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train_set$categorical_crim, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = factor(Boston_scaled_km$cluster[indexes]), type= 'scatter3d', mode='markers')
The crime colouring shows that high crime group is clustered separately, with some med-high groups. The k-means clustering colouring shows clusters 3 and 4 to be together, as in 2d plain, however the 3rd group is split between the two clusters.
The R script transforming the data for the next week asignment is in
the repository,
with the human.csv file in the same directory
The data wrangling was continued in this script These are the descritions, a more detailed information can be found here: https://hdr.undp.org/data-center/documentation-and-downloads, first and second table:
Loading additional libraries
if (!require(FactoMineR) == T) {
install.packages("FactoMineR")
}
## Loading required package: FactoMineR
library(FactoMineR)
human_data_ <- read.csv(file="/home/alex/ods_course/IODS-project_R/data/human.csv")
human_data_ <- column_to_rownames(human_data_,var ="Country")
Graphical summary:
ggpairs(human_data_, progress = F)
1. All the distributions are skewed, meaning that extreme values are
more common for some variables. Life expectancy tends to high, GNI_C
tend to low. 2. A lot of variables are strongly correlated positively
and negatively: - Life expectancy, education levels, GNI are positively
correlated with good significance, and are negatively correlated with
“Maternal Mortality Ratio” and “Adolescent Birth Rate” with similar
significance. This is to be expected. - MMortality is correlated with
adolscent births, which is to be expected. - Higher secondary education
in females is positivly correlated with GNI and life expectancy.
These resutls are not surprising, as a lot of the indexes are associated with the overall wealth, even though there might be outliers.
Summary:
summary(human_data_)
## FM_2ED_Ratio FM_LFPR_Ratio EduExp LifeExp_B
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7254 1st Qu.:0.5981 1st Qu.:11.22 1st Qu.:66.42
## Median :0.9327 Median :0.7514 Median :13.45 Median :74.00
## Mean :0.8499 Mean :0.7034 Mean :13.13 Mean :71.58
## 3rd Qu.:0.9958 3rd Qu.:0.8525 3rd Qu.:15.20 3rd Qu.:76.95
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI_C MMRatio AdBRate RP_p
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4495 1st Qu.: 13.0 1st Qu.: 13.57 1st Qu.:12.50
## Median : 12081 Median : 55.0 Median : 35.00 Median :19.30
## Mean : 17344 Mean : 150.3 Mean : 47.35 Mean :20.88
## 3rd Qu.: 23112 3rd Qu.: 190.0 3rd Qu.: 71.78 3rd Qu.:27.55
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The data is not normalised, so it is necesserily not really comparable.
We perform the PCA analysis for non-standardised data to see why it is important:
human_data_PCA_raw <- prcomp(human_data_)
summary(human_data_PCA_raw)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.821e+04 183.5476 24.84 11.23 3.696 1.539 0.1913 0.1568
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
Most of the variance is included into the first component.
biplot(human_data_PCA_raw, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The GNI_C is mostly contributing to the PC1. That is because GNI_C has values up to 123124, which is a lot more than any other variable. The nations with high GNI_C are to the left.
As we saw earlier the data is to dissimilar to use for PCA, hence scaling:
human_data_scaled <- scale(human_data_)
summary(human_data_scaled)
## FM_2ED_Ratio FM_LFPR_Ratio EduExp LifeExp_B
## Min. :-2.8446 Min. :-2.5987 Min. :-2.7620 Min. :-2.7457
## 1st Qu.:-0.5220 1st Qu.:-0.5286 1st Qu.:-0.6816 1st Qu.:-0.6272
## Median : 0.3476 Median : 0.2408 Median : 0.1131 Median : 0.2937
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6123 3rd Qu.: 0.7484 3rd Qu.: 0.7381 3rd Qu.: 0.6524
## Max. : 2.7134 Max. : 1.6796 Max. : 2.5239 Max. : 1.4487
## GNI_C MMRatio AdBRate RP_p
## Min. :-0.9206 Min. :-0.7127 Min. :-1.1509 Min. :-1.8531
## 1st Qu.:-0.7057 1st Qu.:-0.6554 1st Qu.:-0.8315 1st Qu.:-0.7435
## Median :-0.2891 Median :-0.4549 Median :-0.3041 Median :-0.1398
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3167 3rd Qu.: 0.1896 3rd Qu.: 0.6012 3rd Qu.: 0.5925
## Max. : 5.8094 Max. : 4.5338 Max. : 3.8760 Max. : 3.2512
human_data_PCA_scaled <- prcomp(human_data_scaled)
summary(human_data_PCA_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0705 1.1430 0.87523 0.77704 0.66349 0.53451 0.45617
## Proportion of Variance 0.5359 0.1633 0.09575 0.07547 0.05503 0.03571 0.02601
## Cumulative Proportion 0.5359 0.6992 0.79496 0.87043 0.92546 0.96117 0.98718
## PC8
## Standard deviation 0.32021
## Proportion of Variance 0.01282
## Cumulative Proportion 1.00000
human_data_PCA_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0705291 1.1430478 0.8752283 0.7770364 0.6634878 0.5345100 0.4561665
## [8] 0.3202124
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## FM_2ED_Ratio -0.35624746 0.05544659 -0.258173727 0.61994816 -0.5923317
## FM_LFPR_Ratio 0.05140236 0.72511638 -0.574905230 0.04778188 0.2842048
## EduExp -0.42844344 0.13794051 -0.068952381 -0.06994677 0.1577789
## LifeExp_B -0.44425251 -0.02871318 0.112160198 -0.05130932 0.1525719
## GNI_C -0.35074145 0.05123343 -0.198230209 -0.73273506 -0.4859385
## MMRatio 0.43708962 0.14765550 -0.123634905 -0.25658826 -0.1736624
## AdBRate 0.41035458 0.08931249 0.008796231 0.05375321 -0.4810245
## RP_p -0.08404578 0.64720641 0.728586153 0.01511136 -0.1500635
## PC6 PC7 PC8
## FM_2ED_Ratio 0.19541082 0.057381277 0.16336822
## FM_LFPR_Ratio -0.03316792 -0.226716792 -0.07410761
## EduExp -0.39597156 0.776784601 -0.05176554
## LifeExp_B -0.42259100 -0.439295194 0.62590822
## GNI_C 0.11842241 -0.138379206 -0.16985528
## MMRatio 0.17148338 0.351654675 0.72304994
## AdBRate -0.74968006 -0.078049728 -0.14549600
## RP_p 0.14102220 0.005543132 -0.02360083
The PC1 is still the most important component, but now it only explains around 54% of variance, PC2 captures around 16%, then PC3 less than 10% and all the other components display diminishing returns.
names_for_the_biplot <- round(summary(human_data_PCA_scaled)$importance[2, ] * 100, digits = 2)
names_for_the_biplot <- paste0(names(names_for_the_biplot), " (", names_for_the_biplot, "%)")
biplot(human_data_PCA_scaled, choices = 1:2, xlab = names_for_the_biplot[1], ylab = names_for_the_biplot[2], cex = c(0.5, 0.7))
These results make al ot more sense and also are in concordance with what I originally concluded form the first graphical analysis: 1. PC1 is negatively associated with GNI, Life expectancy, Education and the ratio of educated females in the - all correlted positivley, and increase in MMortality and adolecent births leads to increase in PC1. These values were correlated to each other. 2. PC2 increases with the number of female representatives and female percentage in labour force, which are correlated to each other. As mentioned above the differences are due to the GNI being a much larger influence without scaling. With scaling we can actually see what’s happening.
The PC1 is related to what is usually descirbed as development level of a nation. Developed nations have high income, better education and life expectancy, which developing conuthries and failed states have high adolescent pregnancies and mortality. It is widely believed, that female education is the main driving factor in the population reproducibility index decrease. The second PC shows that there is a connection between female representation in government and overall inclusion in work force, but it is not strictly connected ot the “PC1 - development level” meaning that in poorer countries females can also participate in the govrnment and workforce. Also in petrocraties of the Middle East, despite high GNI and associated parameters females have lower representation.
library(FactoMineR)
tea_data <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# View(tea_data)
str(tea_data)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Everything is a factor, apart from age. But we can caonvert age to a factor as well, might be worth it.
summary(tea_data)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
dim(tea_data)
## [1] 300 36